arXiv:1503.02335vl [cs.CL] 8 Mar 2015 


An Unsupervised Method for Uncovering Morphological Chains 


Karthik Narasimhan, Regina Barzilay and Tommi Jaakkola 

CSAIL, Massachusetts Institute of Teehnology 

{karthikn, regina, tommi}@csail.mit.edu 


Abstract 

Most state-of-the-art systems today produce 
morphological analysis based only on ortho¬ 
graphic patterns. In contrast, we propose a 
model for unsupervised morphological anal¬ 
ysis that integrates orthographic and seman¬ 
tic views of words. We model word forma¬ 
tion in terms of morphological chains, from 
base words to the observed words, breaking 
the chains into parent-child relations. We use 
log-linear models with morpheme and word- 
level features to predict possible parents, in¬ 
cluding their modifications, for each word. 
The limited set of candidate parents for each 
word render contrastive estimation feasible. 
Our model consistently matches or outper¬ 
forms five state-of-the-art systems on Arabic, 
English and Turkishj^ 


1 Introduction 

Morphologically related words exhibit connections 
at multiple levels, ranging from orthographical pat¬ 
terns to semantic proximity. For instance, the words 
playing and played share the same stem, but also 
carry similar meaning. Ideally, all these comple¬ 
mentary sources of information would be taken into 
account when learning morphological structures. 


Most state-of-the-art unsupervised approaches to 
morphological analysis are built primarily around 
orthographic patterns in morphologically-related 


words (|Goldwater and Johnson, 20041 |Creutz and 

Lagus, 2007[ Snyder and Barzilay, 2008 

Poon et 

al., 2009 

1 . In these approaches, words are com- 


monly modeled as concatenations of morphemes. 


'Code is available at https://github.com/ 
karthikncode/MorphoChain 


This morpheme-centric view is well-suited for un¬ 
covering distributional properties of stems and af¬ 
fixes. But it is not well-equipped to capture semantic 
relatedness at the word level. 

In contrast, earlier approaches that capture se¬ 
mantic similarity in morphological variants oper¬ 


ate solely at the word level (Schone and Juraf- 


sky, 20001 Baroni et al., 2002). Given two candi¬ 


date words, the proximity is assessed using standard 
word-distributional measures such as mutual infor¬ 
mation. However, the fact that these models do not 
model morphemes directly greatly limits their per¬ 
formance. 

In this paper, we propose a model to integrate or¬ 
thographic and semantic views. Our goal is to build 
a chain of derivations for a current word from its 
base form. For instance, given a word playfully, the 
corresponding chain is play —/ playful —/ playfully. 
The word play is a base form of this derivation as 
it cannot be reduced any further. Individual deriva¬ 
tions are obtained by adding a morpheme (ex. -ful) 
to a parent word (ex. play). This addition may be 
implemented via a simple concatenation, or it may 
involve transformations. At every step of the chain, 
the model aims to find a parent-child pair (ex. play- 
playful) such that the parent also constitutes a valid 
entry in the lexicon. This allows the model to di¬ 
rectly compare the semantic similarity of the parent- 
child pair, while also considering the orthographic 
properties of the morphemic combination. 

We model each step of a morphological chain by 
means of a log-linear model that enables us to in¬ 
corporate a wide range of features. At the seman¬ 
tic level, we consider the relatedness between two 
words using the corresponding vector embeddings. 
At the orthographic level, features capture whether 


























the words in the chain actually occur in the corpus, 
how affixes are reused, as well as how the words 
are altered during the addition of morphemes. We 
use Contrastive Estimation ( |Smith and Eisner, 2005| ) 
to efficiently learn this model in an unsupervised 
manner. Specifically, we require fhaf each word has 
greater supporf among ifs bounded sef of candidate 
parenfs fhan an arfificially consfrucfed neighboring 
word would. 

We evaluafe our model on dafasefs in fhree lan¬ 
guages: Arabic, English and Turkish. We compare 
our performance againsf five sfafe-of-fhe-arf unsu¬ 


pervised systems: Morfessor Baseline (Virpioja ef 


ah, 20131, Morfessor CafMAP (Creufz and Eagus, 


20051, AGMorph ( Sirfs and Goldwafer, 2013) , fhe 
Eee Segmenfer ( |Eee ef ah, 2011 Sfallard el ah, 
2012| ) and fhe sysfem of Poon ef al. (2009| l. Our 
model consisfenlly equals or oufperforms Ihese sys¬ 
tems across fhe fhree languages. Eor insfance, on 
English, we obfain an 8.5% gain in E-measure over 
Morfessor. Our experimenls also demonslrale fhe 
value of semanlic informafion. While fhe conlribu- 
lion varies from 3% on Turkish lo 11% on fhe En¬ 
glish dalasel, if neverlheless improves performance 
across all fhe languages. 

2 Related Work 

Currenlly, fop performing unsupervised morpholog¬ 
ical analyzers are based on fhe orfhographic prop¬ 
erties of sub-word unils (Creufz and Eagus, 2005[ 


Creufz and Eagus, 2007} Poon ef ah, 2009} Sirfs and 
Goldwafer, 2013| |. Adding semanfic informafion fo 
Ihese syslems is nof an easy lask, as fhey operale al 
the level of individual morphemes, rather than mor¬ 
phologically related words. 

The value of semantic information has been 
demonstrated in earlier work on morphological anal¬ 
ysis. Schone and Jurafsky (20001) employ an ESA- 


based similarity measure to identify morphological 
variants from a list of orthographic ally close word 
pairs. The filtered pairs are then used to identify 
stems and affixes. Based on similar infuifion, |Baroni| 
ef al. (2002] | design a melhod fhaf inlegrales fhese 
sources of informafion, capfured as Iwo word pair 
lisls, ranked based on edil dislance and mulual in- 
formalion. These lisls are subsequenlly combined 
using a deterministic weighting function. 


In bolh of Ihese algorilhms, orfhographic related¬ 
ness is based on simple deferminislic rules. There¬ 
fore, semanfic relaledness plays an essenlial role in 
fhe success of Ihese melhods. However, Ihese al¬ 
gorilhms do nof caplure disfribulional properlies of 
morphemes fhaf are critical fo fhe success of currenf 
slafe-of-lhe-arl algorilhms. In confrasf, we utilize 
a single slafislical framework fhaf seamlessly com¬ 
bines bolh sources of information. Moreover, if al¬ 
lows us lo incorporate a wide range of addilional 
fealures. 

Our work also relales lo fhe log-linear model for 


morphological segmenlalion developed by Poon el 
al. (2009]). They propose a joinl model over all 


words (observalions) and fheir segmenlafions (hid¬ 
den), using morphemes and fheir conlexls (charac¬ 
ter n-grams) for fhe fealures. Since fhe space of all 
possible segmentation sets is huge, learning and in¬ 
ference are quite involved. They use lechniques like 
Conlraslive Eslimafion, sampling and simulated an¬ 
nealing. In confrasf, our formulalion does nof re- 
sull in such a large search space. Eor each word, 
fhe number of parenl candidates is bounded by ifs 
lengfh mulfiplied by fhe number of possible Irans- 
formafions. Therefore, Conlraslive Estimation can 
be implemenled via enumeration, and does nof re¬ 
quire sampling. Moreover, operaling al fhe level of 
words (ralher fhan morphemes) enables us lo incor- 
porafe semantic and word-level fealures. 

Mosl recenlly, work by Sirfs and Goldwafer 


(20131 uses Adaptor Grammars for minimally su¬ 


pervised segmenlalion. By defining a morphological 
grammar consisfing of zero or more prefixes, stems 
and suffixes, fhey induce segmenlafions over words 
in bolh unsupervised and semi-supervised sellings. 
While fheir model (AGMorph) builds up a word by 
combining morphemes in fhe form of a parse free, 
we operate al fhe word level and build up fhe final 
word via infermediale words in fhe chain. 

In olher relafed work, Dreyer and Eisner (2011 1 ) 
fackle fhe problem of recovering morphological 
paradigms and inflectional principles. They use 
a Bayesian generalive model wilh a log-linear 
framework, using expressive fealures, over pairs of 
slrings. Their work, however, handles a differenl 
fask from ours and requires a small amounf of an- 
nolaled dala to seed fhe model. 

In Ibis work, we make use of semantic infer- 






























































mation to help morphological analysis. Lee et al. 


(20111 present a model that takes advantage of syn¬ 


tactic context to perform better morphological seg¬ 
mentation. Stallard et al. (20121 improve on this ap¬ 


proach using the technique of Maximum Marginal 
decoding to reduce noise. Their best system con¬ 
siders entire sentences, while our approach (and the 
morphological analyzers described above) operates 
at the vocabulary level without regarding sentence 
context. Hence, though their work is not directly 
comparable to ours, it presents an interesting orthog¬ 
onal view to the problem. 


Segment 

Cosine Similarity 

P 

0.095 

pi 

-0.037 

pla 

-0.041 

play 

0.580 

playe 

0.000 

player 

1.000 


Table 1: Cosine similarities between word vectors of 
various segments of the word player and the vector 
of player. 


3 Model 

3.1 Definitions and Framework 

We use morphological chains to model words in the 
language. A morphological chain is a short sequence 
of words that starts from the base word and ends up 
in a morphological variant. Each node in the chain 
is, by assumption, a valid word. We refer to the word 
that is morphologically changed as the parent word 
and its morphological variant as the child word. A 
word that does not have any morphological parent is 
a base word (e.g., words like play, chat, run)|^ 

Words in a chain (other than the base word) are 
created from their parents by adding morphemes 
(prefixes, suffixes, or other words). For example, 
a morphological chain that ends up in the word in¬ 
ternationally could be nation —)■ national —)> inter¬ 
national —)■ internationally. The base word for this 
chain is nation. Note that the same word can belong 
to multiple morphological chains. For example, the 
word national appears also as part of another chain 
that ends up in nationalize. These chains are treated 
separately but with shared statistical support for the 
common parts. For this reason, our model breaks 
morphological chains into possible parent-child re¬ 
lations such as (nation, national). 

We use a log-linear model for predicting parent- 
child pairs. A log-linear model allows an easy, effi¬ 
cient way of incorporating several different features 
pertaining to parent-child relations. In our case, we 
leverage both orthographic and semantic patterns to 
encode representative features. 

^We distinguish base words from morphological roots which 
do not strictly speaking have to be valid words in the language. 


A log-linear model consists of a set of features 
represented by a feature vector cp : W x Z ^ 
and a corresponding weight vector 6 G Here, W 
is a set of words and Z is the set of candidates for 
words in W, that includes the parents as well as their 
types. Specifically, a candidate is a (parent, type) 
pair, where the type variable keeps track of the type 
of morphological change (or the lack thereof if there 
is no parent) as we go from the parent to the child. 
In our experiments, Z is obtained by collecting to¬ 
gether all sub-words created by splitting observed 
words in W at all different points. For instance, if 
we take the word cars, the candidates obtained by 
splitting would include (car. Suffix), (ca. Suffix), (c. 
Suffix), (ars. Prefix), (rs. Prefix) and (s. Prefix). 

Note that the parent may undergo changes as it 
is joined with the affix and thus, there are more 
choices for the parent than just the ones obtained by 
splitting. Hence, to the set of candidates, we also 
add modified sub-words where transformations in¬ 
clude character repetition (plan —)> planning), dele¬ 
tion (decide —)• deciding) or replacement (carry —)■ 
Carrie d)|3 Following the above example for the 
word cars, we get candidates like (cat. Modify) and 
(cart. Delete). Each word also has a stop candidate 
(-, Stop), which is equivalent to considering it as a 
base word with no parent. 

Fet us define the probability of a particular word- 
candidate pair (w G W, z G Z) as P{w, z) oc 
^e-(j>{w,z) _ The conditional probability of a candidate 


^We found that restricting the set of parents to sub-words 
that are at least half the length of the original word helped im¬ 
prove the performance of the system. 










z given a word w is then 


P{z\w) 


^0-4>{w,z) 


Z^z'eC{w)^ ’ 


z G C{w) 


where C{w) C Z refers to the set of possible candi¬ 
dates (parents and their types) for the word w G W. 

In order to generate a possible ancestral chain for 
a word, we recursively predict parents until the word 
is predicted to be a base word. In our model, these 
choices are included in the set of candidates for the 
specific word, and their likelihood is controlled by 
the type related features. Details of these and other 


features are given in section 3.2 


3.2 Features 


This section provides an overview of the features 
used in our model. The features are defined for a 
given word w, parent p and type t (recall that a can¬ 
didate z G Z is the pair For computing 

some of these features, we use an unannotated list 
of words with frequencies (details in section Q. Ta¬ 
ble [^provides a summary of the features. 


Semantic Similarity We hypothesize that mor¬ 
phologically related words exhibit semantic similar¬ 
ity. To this end, we introduce a feature that mea¬ 
sures cosine similarity between the word vectors of 
the word (w) and the parent (^. These word vectors 
are learned from co-occurrence patterns from a large 
corpu^(see section |^for details). 

To validate this measure, we computed the cosine 
similarity between words and their morphological 
parents from the CELEX2 database ([Baayen et al. 


19951. On average, the resulting word-parent sim¬ 


ilarity score is 0.351, compared to 0.116 for ran¬ 
domly chosen word-parent combinations]^ 


Affixes A distinctive feature of affixes is their fre¬ 
quent occurrence in multiple words. To capture 
this pattern, we automatically generate a list of fre¬ 
quently occurring candidate affixes. These candi¬ 
dates are collected by considering the string differ¬ 
ence between a word and its parent candidates which 
appear in the word list. Eor example, for the word 
paints, possible suffixes include -s derived from the 

"'For strings which do not have a vector learnt from the cor¬ 
pus, we set the cosine value to he -0.5. 

^The cosine values range from around -0.1 to 0.7 usually. 


Language 

Top suffixes 

English 

Turkish 

Arabic 

-s, -’s, -d, -ed, -ing, -s’, -ly, -er, -e 
-n, -i, -lar, -dir, -a, -den, -de, -in, -leri, -ler 
-p, -A, -F, -y, -t, -AF, -h, -hA, -yp, -At 


Table 3: Top ten suffixes in automatically produced 
suffix lists. 


parent paint, -ts from the parent pain and -ints from 
the word pa. Similarly, we compile a list of poten¬ 
tial prefixes. These two lists are sorted by their fre¬ 
quency and thresholded. Eor each affix in the lists, 
we have a corresponding indicator variable. Eor un¬ 
seen affixes, we use an UNK (unknown) indicator. 

These automatically constructed lists act as a 
proxy for the gold affixes. In English, choosing the 
top 100 suffixes in this manner gives us 43 correct 
suffixes (compared against gold suffixes). Table 
gives some examples of suffixes generated this way. 

Affix Correlation While the previous feature con¬ 
siders one affix assignment at a time, there is a 
known correlation between affixes attached to the 
same stem. Eor instance, in English, verbs that can 
be modified by the suffix -ing, can also take the 
related suffix -ed. Therefore, we introduce a fea¬ 
ture that measures, whether for a given affix and 
parent, we also observe in the wordlist the same 
parent modified by its related affix. Eor exam¬ 
ple, for the pair (walking, walk), the feature in¬ 
stance AffixCorr(ing, ed) is set to 1, because the 
word walked is in the WordEist. 

To construct pairs of related affixes, we compute 
the correlation between pairs in auto-generated affix 
list described previously. This correlation is propor¬ 
tional to the number of stems the two affixes share. 
Eor English, examples of such pairs include (inter-, 
re-), (under-, over-), (-ly, -s), and (-er, -ing). 

Presence in Wordlist We want to bias the model 
to select parents that constitute valid wordsj^ More¬ 
over, we would like to take into account the fre¬ 
quency of the parent words. We encode this infor¬ 
mation as the logarithm of their word counts in the 
wordlist (WordFreq). Eor parents not in the wordlist, 
we set a binary OutOjVocab feature to 1. 

®This is not an absolute requirement in the model. 












Feature type 

Word (w) 

Candidate (p,t) 

Feature 

Value 

Cosine 

painfer 

(paint. Suffix) 

w ■ p 

0.58 

Affix 

painfer 

(paint. Suffix) 

sufftx=er 

1 

Affix Correlafion 

walking 

(walk. Suffix) 

AffixCorr(ing, ed) 

1 

Wordlisf 

painfer 

(paint. Suffix) 

WordFreq 

8.73 




OutOJVocab 

0 

Transformafions 

planning 

(plan. Repeat) 

type=Repeat x chars=(n,-) 

1 


deciding 

(decide. Delete) 

type=Delete x chars=(e,-) 

1 


carried 

(carry. Modify) 

type=Modify x chars=(y,i) 

1 

Sfop 

painfer 

(-, Stop) 

begin=pa 

1 




end=er 

1 




0.5 < MaxCos <0.6 

1 




length=l 

1 


Table 2: Example of various types of features used in the model, w and p are the word vectors for the word 
and parent, respectively. 


Transformations We also support transforma¬ 
tions to enable non-concatenative morphology. Even 
in English, which is mostly concatenative, such 
transformations are frequent. We consider three 
kinds of transformations previously considered in 
the literature ( |Goldwater and Johnson, 2004 1: 

• repetition of the last character in the parent 
(ex. plan —)> planning) 

• deletion of the last character in the parent 
(ex. decide —)■ deciding) 

• modification of the last character of the parent 
(ex. carry —)• carried) 

We add features that are the cartesian prod¬ 
uct of the type of transformation and the charac- 
ter(s) involved. Eor instance, for the parent-child 
pair {believe, believing), the feature type=Delete x 
chars=(e,-) will be activated, while the rest of the 
transformational features will be 0. 

Stop Condition Einally, we introduce features 
that aim to identify base words which do not have 
a parent. The features include the length of the 
word, and the starting and ending character uni¬ 
grams and bigrams. In addition, we include a feature 
that records the highest cosine similarity between 
the word and any of its candidate parents. This fea¬ 
ture will help, for example, to identify paint as a 
base word, instead of choosing pain as its parent. 


3.3 Learning 

We learn the log-linear model in an unsupervised 
manner without explicit feedback about correct mor¬ 
phological segmentations. We assume that we have 
an unannotated wordlist D for this purpose. A typ¬ 
ical approach to learning such a model would be to 
maximize the likelihood of all the observed words 
in D over the space of all strings constructible in 
the alphabet, S*, by marginalizing over the hidden 
candidates]^ In other words, we could use the EM- 
algorithm to maximize 


L{e-D) 


n 

w*£D 


n piw*,z) 

w*&D z&C{w*) 


n 

w*£D 


,z) 

Ljz(^C{w*)^ 

V V ^e-6{w,z) 

2^z^C{'w) ^ 


( 1 ) 


However, maximizing L{9] D) is problematic since 
approximate methods would be needed to sum over 
S* in order to calculate the normalization term in 
Q- Moreover, we would like to encourage the 
model to emphasize relevant parent-child pair^out 
of a smaller set of alternatives rather than those per¬ 
taining to all the words. 

’We also tried maximizing instead of marginalizing, but the 
model gets stuck in one of the numerous local optima. 

®In other words, assign higher probability mass. 





















We employ Contrastive Estimation (Smith and 


Eisner, 2005|l and replace the normalization term by 


a sum over the neighbors of each word. Eor each 
word in the language, we create neighboring strings 
in two sets. Eor the first set, we transpose a single 
pair of adjacent characters of the word. We perform 
this transposition over the first k or the last k charac¬ 
ters of the word|^ Eor the second set, we transpose 
two pairs of characters simultaneously - one from 
the first k characters and one from the last k. 

The combined set of artificially constructed words 
represents the events that we wish to move probabil¬ 
ity mass away from in favor of the actually observed 
words. The neighbors facilitate the learning of good 
weights for the affix features by providing the re¬ 
quired contrast (at both ends of the words) to the 
actual words in the vocabulary. A remaining con¬ 
cern is that the model may not distinguish any arbi¬ 
trary substring from a good suffix/prefix. Eor exam¬ 
ple, -ng appears in all the words that end with -ing, 
and could be considered a valid suffix. We include 
other features to help make this distinction. Specifi¬ 
cally, we include features such as word vector simi¬ 
larity and the presence of the parent in the observed 
wordlist. Eor example, in the word painting, the par¬ 
ent candidate paint is likely to occur and also has a 
high cosine similarity with painting in terms of their 
word vectors. In contrast, painti does not. 

Given the list of words and their neighborhoods, 
we define the contrastive likelihood as follows: 


Lce{0, D) 


= n 

w*&D 




,z) 


E 




wGN{w*) Z^z£C(w) 


( 2 ) 


where N {w *) is the neighborhood of w*. This like¬ 
lihood is much easier to evaluate and optimize. 

After adding in a standard regularization term, we 
maximize the following log likelihood objective: 


E 

w* eD 


log Y, 

z&C(w*) 


(3) 


-log E E 



wGN^w*) z&C(w) 

®The performance increases with increasing k until k = 5, 
after which no gains were ohserved. 


The corresponding gradient can be derived as: 


dLcE{(^', D) 


dOi 


E 

w*£D . 


6-(f){w* , 2 ) 


Y.z&C(w*)(t>j{w*,z)-e ■ 
12w&n(w*) T.z&c(w) 


E 


wdN(w*) E/2SC(to) 


^d-(j){'W,z) 


- 2A6»,- 


(4) 


We use EBEGS-B ( [Byrd et al., 1995| ) to optimize 
Lce{G] D) with gradients given above. 


3.4 Prediction 

Given a test word, we predict a morphological chain 
in a greedy step by step fashion. In each step, we use 
the learnt weights to predict the best parent for the 
current word (from the set of candidates), or choose 
to stop and declare the current word as a base word if 
the stop case has the highest score. Once we have the 
chain, we can derive a morphological segmentation 
by inserting a segmentation point (into the test word) 
appropriately for each edge in the chain. 

Algorithms [T] and [^provide details on the predic¬ 
tion procedure. In both algorithms, type refers to 
the type of modification (or lack of) that the parent 
undergoes: Prefix/Suffix addition, types of transfor¬ 
mation like repetition, deletion, modification, or the 
Stop case. 


Algorithm 1 Procedure to predict a parent for a 
word_ 

1: procedure PREDiCT(word) 

2: candidates •(— CANDIDATES (mord) 

3: bestScore ■(— 0 

4: bestCand •(— STOP) 

5: for cand £ candidates do 

6: /eafnresFEATURES(mord, cand) 

7: score £- M0DEhSC0RE{features) 

8: if score > bestScore then 

9: bestScore £- score 

10: bestCand cand 

11: return bestCand 






















Algorithm 2 Procedure to predict a morphological 
chain_ 

1: procedure GETCHAiN(word) 

2: candidate ■(— PREDlCT(r(;or(i) 

3: parent, type ^ candidate 

4: if type = STOP then return 

[{word, STOP)] 

5: return GETCHAlN(parent)+[(parent, type)] 


Lang 

Train 
(# words) 

Test 

(# words) 

WordVec 
(# words) 

English 

MC-10 

(878K) 

MC-05:10 

(2218) 

Wikipedia 

(129M) 

Turkish 

MC-10 

(617K) 

MC-05:10 

(2534) 

BOUN 

(361M) 

Arabic 

Gigaword 

(3.83M) 

ATB 

(119K) 

Gigaword 

(1.22G) 


Table 4: Data corpora and statistics. MC-10 = Mor- 
phoChallenge 201(p^ MC-05:10 = MorphoChal- 
lenges 2005-10 (aggregated), BOUN = BOUN cor¬ 
pus (Sak et ah, 20081, Gigaword = Arabic Gigaword 
corpus ( Parker et ah, 20 iT] ), ATB = Arabic Tree- 
bank dMaamouri et ah, 200311 


4 Experimental Setup 


Data We run experiments on three different lan¬ 
guages: English, Turkish and Arabic. For each lan¬ 
guage, we utilize corpora for training, testing and 
learning word vectors. The training data consists of 
an unannotated wordlist with frequency information, 
while the test data is a set of gold morphological 
segmentations. For the word vectors, we train the 


word2vec tool (Mikolov et ah, 20131 on large text 
corpora and obtain 200-dimensional vectors for all 
three languages. Table [^provides information about 
each dataset. 


all segmentation points in the data. As is common 
in unsupervised segmentation (Poon et ah, 200^ 


Sirts and Goldwater, 20131 ), we included the test 
words (without their segmentations) with the train¬ 
ing words during parameter learning. 

Baselines We compare our model with five other 
systems: Morfessor Baseline (Morf-Base), Morfes- 
sor CatMap (Morf-Cat), AGMorph, the Fee Seg- 
menter and the system of Poon et al. (2009| ). Mor¬ 
fessor has achieved excellent performance on the 
MorphoChallenge dataset, and is widely used for 
performing unsupervised morphological analysis on 
various languages, even in fairly recent work ( |Fu- 
ong et ah, 20131. In our experiments, we employ 
two variants of the system because their relative per¬ 
formance varies across languages. We use publicly 
available implementations of these variants ( |Virpi- 
oja et ah, 2013t Creutz and Fagus, 2005). We 
perform several runs with various parameters, and 
choose the run with the best performance on each 
language. 

We evaluate AGMorph by directly obtaining the 
posterior grammars from the authorsj^ We show 
results for the Compounding grammar, which we 
find has the best average performance over the lan¬ 
guages. The Fee Segmenter ( Fee et ah, 201 1| ), im¬ 
proved upon by using Maximum Marginal decoding 


m 


Stallard et al. (2012), has achieved excellent per¬ 


formance on the Arabic (ATB) dataset. We perform 
comparison experiments with the model 2 (M2) of 
the segmenter, which employs latent POS tags, and 
does not require sentence context which is not avail¬ 
able for other languages in the dataset. We obtained 
the code for the system, and run it on our English 
and Turkish datasets^E] We do not have access to an 
implementation of Poon et aTs system; hence, we 
directly report scores from their paper on the ATB 
dataset and test our model on the same data. 


Evaluation measure We test our model on the 
task of morphological segmentation. We evalu¬ 
ate performance on individual segmentation points, 
which is standard for this task (Virpioja et ah, 20111. 
We compare predicted segmentations against the 
gold test data for each language and report overall 
Precision, Recall and F-1 scores calculated across 


''’http://research.ics.aalto.fi/events/morphochallenge/ 


5 Results 

Table [^details the performance of the various mod¬ 
els on the segmentation task. We can see that 
our method outperforms both variants of Morfessor, 

"The grammars were trained using data we provided to 
them. 

'^We report numbers on Arabic directly from their paper. 












































Figure 1: Model performance vs data size obtained 
by frequency thresholding 

with an absolute gain of 8.5%, 5.1% and 5% in F- 
score on English, Turkish and Arabic, respectively. 
On Arabic, we obtain a 2.2% absolute improvement 
over Poon et al.’s model. AGMorph doesn’t seg¬ 
ment better than Morfessor on English and Arabic 
but does very well on Turkish (60.9% El compared 
to our model’s 61.2%). This could be due to the fact 
that the Compounding grammar is well suited to the 
agglutinative morphology in Turkish and hence pro¬ 
vides more gains than for English and Arabic. The 
Eee Segmenter (M2) performs the best on Arabic 
(82% El), but lags behind on English and Turkish. 
This result is consistent with the fact that the system 
was optimized for Arabic. 

The table also demonstrates the importance of the 
added semantic information in our model. Eor all 
three languages, having the features that utilize co¬ 
sine similarity provides a significant boost in perfor¬ 
mance. We also see that the transformation features 
and affix correlafion feafures play a role in improv¬ 
ing fhe resulfs, alfhough a less imporfanf one. 

Nexf, we sfudy fhe effecf of dafa qualify on fhe 
predicfion of fhe algorifhm. A fraining sef offen 
confains misspellings, abbreviafions and fruncafed 
words. Thresholding based on frequency is com¬ 
monly used fo reduce fhis noise. Eigure[T] shows fhe 
performance of fhe algorifhm as a funchon of fhe 
dafa size obfained af various degrees of fhreshold- 
ing. We nofe fhaf fhe performance of fhe model on 
all fhree languages remains quife sfable from abouf 


Lang 

Method 

Free 

Recall 

F-1 


Morf-Base 

0.740 

0.623 

0.677 


Morf-Cat 

0.673 

0.587 

0.627 


AGMorph 

0.696 

0.604 

0.647 

English 

Ue (M2) 

0.825 

0.525 

0.642 


Model -C 

0.555 

0.792 

0.653 


Model -T 

0.831 

0.664 

0.738 


Model -A 

0.810 

0.713 

0.758 


Eull model 

0.807 

0.722 

0.762 


Morf-Base 

0.827 

0.362 

0.504 


Morf-Cat 

0.522 

0.607 

0.561 


AGMorph 

0.878 

0.466 

0.609 

Turkish 

Lee (M2) 

0.787 

0.355 

0.489 


Model -C 

0.516 

0.652 

0.576 


Model -T 

0.665 

0.521 

0.584 


Model -A 

0.668 

0.543 

0.599 


Eull model 

0.743 

0.520 

0.612 


Morf-Base 

0.807 

0.204 

0.326 


Morf-Cat 

0.774 

0.726 

0.749 


AGMorph 

0.672 

0.761 

0.713 

Arabic 

Poon et al. 

0.885 

0.692 

0.777 

Lee (M2) 

- 

- 

0.820 


Model -C 

0.626 

0.912 

0.743 


Model -T 

0.774 

0.807 

0.790 


Model -A 

0.775 

0.808 

0.791 


Eull model 

0.770 

0.831 

0.799 


Table 5: Resulfs on unsupervised morphological 
segmenfafion; scores are calculafed across all seg- 
menfafion poinfs in fhe fesf dafa. Baselines are 
in ifalics. -C=wifhouf cosine feafures, -T=wifhouf 
fransformafion feafures, -A=wifhouf affix correla¬ 
fion feafures. Numbers on Arabic for Poon ef al. and 
Eee (M2) are reporfed direcfly from fheir papers. 

1000 fo 10000 fraining words, affer which fhe devia- 
fions are more apparenf. The plof also demonsfrafes 
fhaf fhe model works well even wifh a small amounf 
of qualify dafa (f«3000 mosf frequenf words). 

Error analysis We look al a random subsel of 50 
incorrecfly segmenfed word^in fhe model’s oufpuf 
for each language. Table gives a breakup of errors 
in all 3 languages due fo over or under-segmenlafion. 
Table provides examples of correcl and incorrecf 
segmenfalions predicfed by our model. 

*^Words with at least one segmentation point incorrect 




















Language 

Correct Segmentations 

Incorrect Segmentations 

Word 

Segmentation 

Word 

Predicted 

Correct 

English 

salvoes 

negotiations 

telephotograph 

unequivocally 

carsickness’s 

salvo-es 

negotiat-ion-s 

tele-photo-graph 

un-equivocal-ly 

car-sick-ness-’s 

contempt 

sterilizing 

desolating 

storerooms 

tattlers 

con-tempt 

steriliz-ing 

desolating 

storeroom-s 

tattler-s 

contempt 

steril-iz-ing 

desolat-ing 

store-room-s 

tattl-er-s 

Turkish 

moderni 

teknolojideki 

burasiydi 

fizgisine 

degi§iklikte 

modern-i 
teknoloji-de-ki 
bura-si-ydi 
§iz-gi-si-ne 
degi§ik-hk-te 

mektupla§malar 

gelecektiniz 

aynalardan 

uyudugunuzu 

dirsege 

mektupla§ma-lar 

gelecek-tiniz 

ayna-lar-da-n 

uyudu-gu-nuzu 

dirsege 

mektup-la§-ma-lar 

gel-ecek-ti-niz 

ayna-lar-dan 

uyu-dug-unuz-u 

dirseg-e 

Arabic 

sy$Ark 

nyqwsyA 

AlmTrwHp 

ytEAmlwA 

lAtnZr 

s-y-$Ark 

nyqwsyA 

Al-mTrwH-p 

y-tEAml-wA 

lA-t-nZr 

wryfAldw 

bHlwlhA 

jnwby 

wbAym 

rknyp 

w-ry-fAldw 

b-Hlwl-h-A 

jnwb-y 

w-bAyr-n 

rknyp 

w-ryfAldw 

b-Hlwl-hA 

jnwby 

w-bAyrn 

rkny-p 


Table 6: Examples of correct and incorrect segmentations produced by our model on the three languages. 
Correct segmentations are taken directly from gold MorphoChallenge data. 


Lang 

Over-segment 

Under-segment 

English 

10% 

86% 

Turkish 

12% 

78 % 

Arabic 

60% 

40% 


Table 7: Types of errors in analysis of 50 randomly 
sampled incorrect segmentations for each language. 
The remaining errors are due to incorrect placement 
of segmentation points. 

In English, most errors are due to under¬ 
segmentation of a word. We find that around 60% of 
errors are due to roots that undergo transformations 
while morphing into a variant (see table |^for exam¬ 
ples). Errors in Turkish are also mostly due to under¬ 
segmentation. On further investigation, we find that 
most such errors (58% of the 78%) are due to parent 
words either not in vocabulary or having a very low 
word count (< 10). In contrast, we observe a ma¬ 
jority of over-segmentation errors in Arabic (60%). 
This is likely because of Arabic having more sin¬ 
gle character affixes than the other languages. We 
find that 56% of errors in Arabic involve a single¬ 
character affix, which is much higher than the 24.6% 
that involve a two-letter affix. In contrast, 25% of er¬ 
rors in English are due to single character affixes - 
around the same number as the 24% of errors due to 


two-letter affixes. 

Since our model is an unsupervised one, we make 
several simplifying assumptions to keep the candi¬ 
date set size manageable for learning. Eor instance, 
we do not explicitly model infixes, since we select 
parent candidates by only modifying the ends of a 
word. Also, the root-template morphology of Ara¬ 
bic, a Semitic language, presents a complexity we 
do not directly handle. Eor instance, words in Ara¬ 
bic can be formed using specific patterns (known 
as binyanim) (ex. nZr —)• yntZr). However, on 
going through the errors, we find that only 14% 
are due to these binyanim patterns not being cap- 
turedj^ Adding in transformation rules to capture 
these types of language-specific patterns can help in¬ 
crease both chain and segmentation accuracy. 

Analysis of learned distributions To investigate 
how decisive the learnt model is, we examine the 
final probability distribution P{z\w) of parent can¬ 
didates for the words in the English wordlist. We 
observe that the probability of the best candidate 
{maXzP{z\w)), averaged over all words, is 0.77. 
We also find that the average entropy of the distri- 

'"'This might be due to the fact that the gold segmentations 
also do not capture such patterns. For example, the gold seg¬ 
mentation for yntZrwn is given as y-ntZr-wn, even though ntZr 
is not a valid root. 
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Figure 2: Comparison of gold and predicted fre¬ 
quency distributions of the top 15 affixes for English 

butions is 0.65, which is quite low considering that 
the average number of candidates is 10.76 per word, 
which would result in a max possible entropy of 
around 2.37 if the distributions were uniform. This 
demonstrates that the model tends to prefer a single 
parent for every word{^ which is exactly the behav¬ 
ior we want. 

Affix analysis We also analyze the various affixes 
produced by fhe model, and compare wifh fhe gold 
affixes. Parficularly, we plof fhe frequency disfri- 
bufions of fhe affixe^ obfained from fhe gold and 

'^Note that the candidate probability distribution may have 
more than a single peak in some cases. 

'®To conserve space, we only show the distribution of suf¬ 
fixes here, but we observe a similar trend for prefixes. 


predicfed segmenfafions for fhe English fesf dafa in 
figure 

Erom fhe figure, we can see fhaf our model learns 
fo idenfify good affixes for fhe given language. Sev¬ 
eral of fhe lop affixes predicted are also presenf in 
fhe gold lisf, and we also observe similarifies in fhe 
frequency disfribufions. 

6 Conclusion 

In fhis work, we have proposed a discriminafive 
model for unsupervised morphological segmenfa- 
fion fhaf seamlessly infegrafes orthographic and se- 
manfic properfies of words. We use morpholog¬ 
ical chains fo model fhe word formafion process 
and show how fo employ fhe flexibilify of log-linear 
models fo incorporafe bofh morpheme and word- 
level fealures, while handling Iransformalions of 
parenf words. Our model consisfenfly equals or ouf- 
performs five sfafe-of-lhe-arl syslems on Arabic, En¬ 
glish and Turkish. Eufure direcfions of work in¬ 
clude using belter neighborhood funchons for con- 
fraslive esfimafion, exploring olher views of fhe dafa 
fhaf could be incorporafed, examining better predic- 
fion schemes and employing morphological chains 
in olher applicalions in NEP. 
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